Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Raginsky, Maxim (Ed.)Learning operators between infinitely dimensional spaces is an important learning task arising in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.more » « less
-
Loh, Po-Ling; Raginsky, Maxim (Ed.)
-
Loh, Po-ling; Raginsky, Maxim (Ed.)Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f⋆(x)=g(Ux) where U:\Rd→\Rr with d≫r. When the degree of f⋆ is p, it is known that n≍dp samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f⋆. This results in an improved sample complexity of n≍d2 and enables transfer learning with sample complexity independent of d.more » « less
-
Loh, Po-Ling; Raginsky, Maxim (Ed.)We establish a connection between sampling and optimization on discrete domains. For a family of distributions $$\mu$$ defined on size $$k$$ subsets of a ground set of elements, that is closed under external fields, we show that rapid mixing of natural local random walks implies the existence of simple approximation algorithms to find $$\max \mu(\cdot)$$. More precisely, we show that if $$t$$-step down-up random walks have spectral gap at least inverse polynomially large, then $$t$$-step local search finds $$\max \mu(\cdot)$$ within a factor of $$k^{O(k)}$$. As the main application of our result, we show that $$2$$-step local search achieves a nearly-optimal $$k^{O(k)}$$-factor approximation for MAP inference on nonsymmetric $$k$$-DPPs. This is the first nontrivial multiplicative approximation algorithm for this problem. In our main technical result, we show that an exchange inequality, a concept rooted in discrete convex analysis, can be derived from fast mixing of local random walks. We further advance the state of the art on the mixing of random walks for nonsymmetric DPPs and more generally sector-stable distributions, by obtaining the tightest possible bound on the step size needed for polynomial-time mixing of random walks. We bring the step size down by a factor of $$2$$ compared to prior works, and consequently get a quadratic improvement on the runtime of local search steps; this improvement is potentially of independent interest in sampling applications.more » « less
-
Loh, Po-Ling; Raginsky, Maxim (Ed.)
-
An official website of the United States government

Full Text Available